Dyr og Data

Statistical thinking — simple linear models

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-10-04

Introduction

In this session we’ll start to think about modelling the relationships between two or more variables

By modelling the relationship we are suggesting a means by which the data we collected might have originated

We won’t be able to speak to causation in many cases because we have data collected as part of observational studies

But we can test hypotheses about the relationships between variables

Linear Regression

Linear regression

We need to be careful about terminology

All the models we consider on this course are linear models

The phrases

  • Linear regression
  • Simple linear regression
  • Multiple linear regression
  • Least-squares regression

all describe the same model

Gaussian linear model

The model is one where the observations of the response are individually distributed Gaussian (are Gaussian random variables) with

  1. expected value \(\mu_i\) — so each observation has it’s own mean…
  2. and constant variance — \(\sigma\) doesn’t have an subscript \(i\)
  3. \(g()\) is the identity link — it returns its input unchanged

\[\begin{align*} y_i & \sim \text{Gaussian}(\mu_i, \sigma) \\ g(\mu)_i & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_j x_{ji} \end{align*}\]

Gaussian linear model

Linear regression can be thought of as an extension of the ideas behind correlations

The Gaussian linear model is a far more powerful statistical tool, however

Correlations tell us about the strength of the linear relationship between two variables \(x\) and \(y\)

A Gaussian linear model provides an equation for the line of best fit placed through the data

Further, in the linear regression the roles played by \(x\) and \(y\) are different; we say the values of \(y\) depend, to some degree, on the values of \(x\) measured on the same entities

Hence \(x\) plays the role of the predictor variable and \(y\) is the response

One continuous predictor

Simple linear regression is a statistical model that assumes a linear relationship between a continuous response variable \(y\) and a single continuous predictor variable

Three major purposes of such models

  • to describe the linear relationship between \(y\) and \(X\)
  • to determine how much variation (uncertainty) in \(y\) can be explained by the relationship with \(X\)
  • to predict new values of \(y\) from new values of \(X\)

One continuous predictor

In this section we’ll consider the simplest case of a single predictor \(x\) & its relationship with \(y\)

A suitable model for this linear relationship is

\[\begin{align*} y_i & ~ \text{Gaussian}(\mu_i, \sigma) \\ g(\mu)_i & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 x_{1i} \end{align*}\]

The \(\beta_j\) are the model parameters

\(\beta_0\) is the intercept, the expected value of \(y\) when \(x\) is 0

\(\beta_1\) is often called the slope, it measures the rate of change in \(y\) for a unit change in \(x\)

Schematic of the linear regression line

Linear regression

We estimate the two unknown parameters in the model using a procedure known as least squares, where we minimise the Residual Sum of Squares \(\mathrm{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\)

Linear regression

Estimates of parameters (\(\beta_j\)) are for the population based on the fit to our sample of data

Linear regression

coefs <- coef(m)
ses <- sqrt(diag(vcov(m)))
crit <- qt(0.975, df = df.residual(m))
upr <- coefs + (crit * ses)
lwr <- coefs - (crit * ses)

Data were 20 observations generated from the following model

\[\mu_i = 0.7 + 0.8x_i \;\;\;\; y_i \sim N(\mu_i, \sigma = 1)\]

Fitted model estimates are: \(\hat{\beta}_0\) = 0.32 and \(\hat{\beta}_1\) = 0.999

The parameters are means & the uncertainty in the estimated values is captured by their standard errors

Confidence intervals for the estimates:

  • \(\beta_0 \pm t_{0.975} \mathrm{SE}_{\beta_0}\) = -0.401, 1.041
  • \(\beta_1 \pm t_{0.975} \mathrm{SE}_{\beta_1}\) = 0.741, 1.256

Equine colic

Data are from a study of survival in horses with colic (abdominal pain) admitted to The Royal Veterinary and Agricultural University, Copenhagen

  • pulse — heart rate measured in beats per minute,
  • pcvpacked cell volume or hematocrit is the volume % of red blood cells in blood

PCV can become elevated due to, inter alia dehydration or pain

colic <- read_csv("data/horse-colic/colic.csv") |>
  janitor::clean_names()

colic_labs <- labs(y = "Pulse (bpm)", x = "Packed cell volume (%)")

colic |>
  ggplot(aes(x = pcv, y = pulse)) +
  geom_point() +
  colic_labs

Equine colic

We will explore the relationship between PCV and heart rate

colic_m <- lm(pulse ~ pcv, data = colic)

Equine colic — summary

summary(colic_m)

Call:
lm(formula = pulse ~ pcv, data = colic)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.276 -11.783  -3.696  10.275  56.101 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.78068    3.77830   2.589  0.00998 ** 
pcv          1.27542    0.09356  13.631  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.95 on 405 degrees of freedom
  (65 observations deleted due to missingness)
Multiple R-squared:  0.3145,    Adjusted R-squared:  0.3128 
F-statistic: 185.8 on 1 and 405 DF,  p-value: < 2.2e-16

Equine colic — permutation test

colic_null <- withr::with_seed(2, 
  colic |>
    specify(pulse ~ pcv) |>
    hypothesize(null = "independence") |>
    infer::generate(reps = 1000) |>
    calculate(stat = "slope")
)

obs_slope <- colic |>
    specify(pulse ~ pcv) |>
    calculate(stat = "slope")

colic_null |>
  visualize() +
  shade_p_value(obs_stat = obs_slope,
  direction = "two-sided")

Equine colic — summary

printCoefmat(coef(summary(colic_m)), digits = 4)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.78068    3.77830   2.589  0.00998 ** 
pcv          1.27542    0.09356  13.631  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Estimate column contains the estimated parameters
  • The intercept is 9.781, the heart rate (`pulse``) at a PCV level of 0%
  • For every increase of 1% PCV, heart rate increases by 1.275
  • Each row in the table is a statistical test of a parameter, with a null hypothesis \(\beta_j\) = 0
  • \(t\) is the test statistic for the test \(t_j = \frac{\beta_j - 0}{\mathrm{SE}_{\beta_j}}\)
  • What is the probability of observing a value as extreme as \(\hat{\beta}_j\) if \(\beta_j\) = 0?
  • Is the observed coefficient expected if there was no relationship between \(x\) and \(y\)

T tests

\(n\) - 2 degrees of freedom = 405

If \(t\) statistic lies in either rejection regions, reject H0 at \(\alpha\) = 0.05 level

ANOVA table

Linear regression can be seen as a partitioning of variance in the data in amounts explained by variables in the model & unexplained variance

Mean squares can only be positive; even a variable unrelated to response will explain some variance

Compare the ratio of Mean square (the SSq normalised by degrees of freedom) with an F distribution with 1 & 21 degrees of freedom

Analysis of Variance Table

Response: pulse
           Df Sum Sq Mean Sq F value Pr(>F)    
pcv         1  53380   53380     186 <2e-16 ***
Residuals 405 116344     287                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA table

\(F\) is the \(F\)-ratio, the ratio of the regression and residual variances

\[F = \frac{\sum\limits^n_{i=1}(\hat{y}_i - \bar{y})^2 / p}{\sum\limits^n_{i=1}(y_i - \hat{y}_i)^2 / [n-(p+1)]} = \frac{\mathrm{MS_{reg}}}{\mathrm{MS_{resid}}}\]

Probability of \(F\) greater than or equal to observed from \(F\)-distribution with \(p\) and \(n - (p + 1)\) degrees of freedom

ANOVA table

Large values of F are evidence against the null hypothesis of no relationship

Reference distribution is \(\mathsf{F_{1,405}}\); all rejection region is in upper tail

R-Squared

\(R^2\) is a commonly reported measure of variance explained by the regression

\(R^2\) is the coefficient of determination, the ratio of the variance explained to the total variance

\[R^2 = \frac{\mathrm{SS_{reg}}}{\mathrm{SS_{reg} + RSS}} = 1 - \frac{\mathrm{SS_{resid}}}{\mathrm{SS_{total}}}\]

One problem with \(R^2\) is that increases as you add predictors to the model even if those predictors have no explanatory power

Adjusted R-squared

Adjusted \(R^2\) takes into account number of predictors in the model

\[R^2_{\mathrm{adj}} = 1 - \frac{\mathrm{SS_{resid}} / [n - (p + 1)]}{\mathrm{SS_{total}} / (n-1)}\]

Neither measure is a particularly useful summary; don’t indicate how the regression model will predict new observations

The effect size, the size of the model coefficients, is far more useful

Assumptions of linear regression

Assumptions of linear regression

To fit the model, we don’t need to make any assumptions, beyond linearity

Statistical inference on the estimated parameters depends on a number of assumptions

  1. The linear model correctly describes the functional relationship between \(y\) and \(x\)
  2. The \(x_i\) are measured without error
  3. For any given value of \(x_i\), the sampled \(y_i\) values are independent with normally distributed errors
  4. Variances are constant along the regression line/model

Assumptions

The linear model correctly describes the functional relationship between \(y\) and \(x\)

Effectively, this assumption is about whether the model we’ve fitted represents the true underlying relationship between \(x\) and \(y\)

Is the relationship well approximated via a straight line?

Would a curved line (polynomial) be better?

If we have the model form wrong, our model will be biased; fitted values different appreciably from the true values

If violated the estimate of predictor variances (\(\sigma^2\)) will be inflated

Incorrect model specification can show itself as patterns in the residuals

Assumptions of the linear model I

df <- data.frame(Residuals = resid(colic_m), Fitted = fitted(colic_m))
ggplot(df, aes(x = Fitted, y = Residuals)) + geom_point()

Assumptions of the linear model II

\(x_i\) are measured without error

This assumption states that we know the values of the predictor variable(s) \(x\) exactly

Allows us to isolate the error component as random variation in \(y\)

Estimates \(\hat{\beta}_j\) will be biased if there is error in \(x\)

This is often ignored in data analysis, but modern, advanced approaches can account for this extra source of variation

Assumptions of the linear model III

For any \(x_i\), the sampled \(y_i\) are independent Gaussian random variables

Variances are constant along the regression line/model

These 2 assumptions relate to distribution of the residuals (\(\varepsilon_i\)), or, conditional upon the values of \(x\), of \(y\)

\(\varepsilon_i\) are assumed to follow a Normal distribution with zero mean and constant variance

Assumptions of the linear model III

Independence and normality of errors allows us to use parametric theory for confidence intervals and hypothesis tests on the \(F\)-ratio.

Allows a single constant variance \(\sigma^2\) for the variance of the regression line/model

Each residual is drawn from the same distribution; Normal with mean zero, variance \(\hat{\sigma}^2\)

Non-constant variances can be recognised through plots of residuals (among others); residuals get wider as the values of \(y\) increase

Assumptions of the linear model III

p1 <- ggplot(df, aes(sample = Residuals)) +
    stat_qq() +
        xlab("Theoretical Quantiles") +
            ylab("Standardized Residuals")
p2 <- ggplot(df, aes(x = Residuals)) + geom_density()
p1 + p2

Assumptions of the linear model III

df <- df |> mutate(SL = sqrt(abs(rstandard(colic_m))))
ggplot(df, aes(x = Fitted, y = SL)) + geom_point() +
    ylab(expression(sqrt(abs(standardised ~ residuals))))

Assumptions of the linear model IV

Independence is a key assumption; knowing the value of one observation tells use nothing about another beyond their relationship through \(x\)

Data that have spatial or temporal components, or represent repeated observations on the same set of individuals are commonly encountered & violate this assumption

Outliers, Leverage & Influence

Outliers, Leverage & Influence

An outlier is an observation which is inconsistent with the rest of a sample

An observation can be an outlier due to the response variable(s) or one or more of the predictor variables having values outside their expected limits

An outlier may also result from: incorrect measurement, incorrect data entry, transcription error, recording error

Two main concepts

  • Leverage: Potential for an outlier to be influential
  • Influence: Observation is influential if its deletion substantially changes the results

Outliers, Leverage & Influence

Leverage measures the degree to which individual observations affect the the fitted value for that observation

Leverage values are also known as hat values, as they are the values on the diagonal of the hat matrix, which projects the observed values onto the fitted values of the model

Hat matrix is so called because it puts a hat on \(\mathbf{Y}\): \(\mathbf{\hat{Y} = HY}\)

Leverage ranges from 1/n to 1

Observation has high leverage if its hat value is 2–3 times the average hat value: \(h = (p+1)/n\), where \(p+1\) is number of coefficients (inc. the intercept)

Outliers, Leverage & Influence

colic_f <- broom::augment(colic_m)
colic_f <- mutate(colic_f, Observation = seq_along(.hat))
ggplot(colic_f, aes(x = Observation, y = .hat)) +
    geom_bar(stat = "identity") +
    geom_hline(aes(yintercept = 3 / nrow(colic_f))) +
    ylab("Hat values")

Outliers, Leverage & Influence

An observation that combines outlyingness with high leverage exerts an influence on the estimated regression coefficients

If an influential observation is deleted from the model, the estimated coefficients change substantially

\(\mathsf{dfbeta}_{ij}\) assesses the impact on the \(j\)th coefficient of deleting the \(i\)th observation

\[\mathsf{dfbeta}_{ij} = \beta_{j(-i)} - \beta_j\]

The \(\mathsf{dfbeta}_{ij}\) are expressed in the metric of the coefficient

A standardised version, \(\mathsf{dfbetas}_{ij}\) divides \(\mathsf{dfbeta}_{ij}\) by the standard error of \(\beta_j\)

Influential observations have \(\mathsf{dfbetas}_{ij} \geq 2 / \sqrt{n}\)

Outliers, Leverage & Influence: Cook’s Distance

One problem with \(\mathsf{dfbetas}_{ij}\) is that there are so many numbers!

One for each observation for every \(\beta_j\): \(n \times (p+1)\)

, \(D_i\), is a scale invariant measure of distance between \(\beta_j\) and \(\beta_{j(-i)}\)

\[D_i = \frac{e^2_i}{s^2(p+1)} \times \frac{h_i}{1-h_i}\]

where \(e_i^2\) is the squared residual & \(h_i\) is the hat value for \(x_i\); \(s^2\) is the variance of the residuals

The first fraction is a measure of outlyingness, the second of leverage

\(D_i \geq 4 / (n - p - 1)\) suggested as a cutoff for high values of \(D_i\)

Outliers, Leverage & Influence: Cook’s Distance

ggplot(colic_f, aes(x = Observation, y = .cooksd)) +
    geom_bar(stat = "identity") +
    geom_hline(aes(yintercept = 4 / (nrow(colic_f) - 2))) +
    ylab("Cook's Distance")

Outliers, Leverage & Influence: Example

Multiple Regression

Multiple regression

The simple regression model readily generalises to the situation where we have \(m\) predictors not just one

\[\begin{align*} y_i & \sim \text{Gaussian}(\mu_i, \sigma) \\ g(\mu_i) & = \eta_i \\ \eta_i & = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_j x_{ji} \end{align*}\]

Now we have \(m + 1\) parameters to estimate, one for intercept and one each for the \(m\) predictors \(x_m\). Can have as many as \(m = n - 1\) predictor variables

Simple description as a regression lines starts to be blurred; with \(m = 2\) we have a regression plane

But in other respects the model fitting, assumptions, etc remain the same

We do now have the problem of deciding which of the \(m\) predictions are related to \(y\)

Multiple regression: horse colic

Several additional variables were measured in the horse colic study

A number of predictor variables thought to affect the heart rate of animals admitted weith colic pulse

  • Temperature (temp) of each horse
  • Capillary refill time (crt)
  • Standard base excess (sbe)
  • Packed cell volume (pcv)

Multiple regression: horse colic

Typical statistical output for the full model

colic2 <- colic |>
  filter(complete.cases(colic))
colic_mr <- lm(pulse ~ pcv + sbe + temp + crt, data = colic2)
colic_mr_0 <- lm(pulse ~ 1, data = colic2)

print(summary(colic_mr), digits = 3)

Call:
lm(formula = pulse ~ pcv + sbe + temp + crt, data = colic2)

Residuals:
   Min     1Q Median     3Q    Max 
-41.52  -9.87  -2.54   8.33  59.12 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -156.840     46.520   -3.37  0.00083 ***
pcv            0.679      0.105    6.45  3.6e-10 ***
sbe           -0.751      0.170   -4.43  1.3e-05 ***
temp           4.559      1.230    3.71  0.00024 ***
crt            5.225      0.715    7.31  1.8e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.5 on 352 degrees of freedom
Multiple R-squared:  0.479, Adjusted R-squared:  0.473 
F-statistic:   81 on 4 and 352 DF,  p-value: <2e-16

Multiple regression: horse colic

No t tests are significant, but this is only a reflection of what would happen if you removed that variable from model leaving others in the model

However, the joint F test is significant, indicating that there is an effect in there somewhere

print(anova(colic_mr_0, colic_mr), digits = 4)
Analysis of Variance Table

Model 1: pulse ~ 1
Model 2: pulse ~ pcv + sbe + temp + crt
  Res.Df    RSS Df Sum of Sq     F Pr(>F)    
1    356 142862                              
2    352  74393  4     68469 80.99 <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple regression: horse colic

Sequential sums of squares (Type I), ordering matters & changes results

print(anova(colic_mr), digits = 4)
Analysis of Variance Table

Response: pulse
           Df Sum Sq Mean Sq F value   Pr(>F)    
pcv         1  43826   43826  207.37  < 2e-16 ***
sbe         1  10227   10227   48.39 1.71e-11 ***
temp        1   3125    3125   14.78 0.000143 ***
crt         1  11291   11291   53.43 1.82e-12 ***
Residuals 352  74393     211                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple regression: horse colic

Sequential sums of squares (Type I), ordering matters & changes results

print(car::Anova(colic_mr, type = "II"), digits = 4)
Anova Table (Type II tests)

Response: pulse
          Sum Sq  Df F value   Pr(>F)    
pcv         8800   1   41.64 3.63e-10 ***
sbe         4145   1   19.61 1.27e-05 ***
temp        2903   1   13.74 0.000244 ***
crt        11291   1   53.43 1.82e-12 ***
Residuals  74393 352                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance

Analysis of Variance

Analysis of Variance (ANOVA) is the classic name for a linear model where the predictor (explanatory) variables are categorical

Earlier ANOVA used to partition variance in \(y\) into components explained by \(x_j\) & a residual component not explained by the regression model

A slightly more restricted view of ANOVA is that it is a technique for partitioning the variation in \(y\) into that explained by one or more categorical predictor variables

The categories of each factor are the groups or experimental treatments

Analysis of Variance

ANOVA considers the different sources of variation that might arise on a data set

Of particular interest is on the differences in the mean value of \(y\) between groups

We can think of within-group and between-group variances

  • Between-group variance is that due to the treatment (group) effects
  • Within-group variance is that due to the variability of individuals & measurement error

There Will always be variation between individuals but is this within-group variance large or small, relative to the variance between groups?

ANOVA how many ways?

One of the complications surrounding ANOVA is the convoluted nomenclature used describe variants of the method

Variants commonly distinguished by the number of categorical variables in the model

  • contains a single categorical variable
  • contains two categorical variables
  • contains three categorical variables

Two-way and higher ANOVA potentially involve the consideration of factor—factor interactions

One-way ANOVA

In a we have a single categorical variable \(x\) with two or more levels With two levels we have the same analysis as the t test

If we consider differences between animals of different breed, we might use an breed factor whose levels might be

  • Danish Holstein,
  • Red Danish
  • Jersey

If we’re testing the effect of parity, the factor might be parity with levels: 1, 2, 3, & 4+

One-way ANOVA

Assume we have a single categorical variable \(x\) with three levels. The One-way ANOVA model using dummy coding or treatment contrasts is

\[y_i = \beta_0 + \beta_1D_{i1} + \beta_2D_{i2} + \varepsilon_i\]

Where \(D_{ij}\) is the coding for the \(j\)th level (group) for the \(i\)th observation

Group \(D_1\) \(D_2\)
1 0 0
2 1 0
3 0 1

One-way ANOVA

Measure the between-group variance as the regression sums of squares

The within-group variance is the residual sums of squares

An omnibus test F statistic is used to test the null hypothesis of no differences among population group means

\[\mathsf{H_0:} \; \beta_1 = \beta_2 = 0\]

One-way ANOVA

Low & colleagues (2016, PLOS One) conducted a study to examine the effects of two different anesthetics on aspects of mouse physiology.

  • 12 mice anesthetised with isolufrane
  • 11 mice anesthetised with alpha chloralose

Blood CO2 levels were recorded after 120 minutes as the response variable

One-way ANOVAz

mice <- read_csv("data/mice-anesthesia/mice-anesthesia.csv") |>
  mutate(anesth = fct_recode(anesth, "Isoflurane" = "iso", "Alpha chloralose" = "ac")) |>
  rename(anesthetic = anesth)
mice_labs <- labs(x = "Anesthetic", y = expression(CO[2]))
mice |>
  ggplot(aes(y = co2, x = anesthetic)) +
  geom_boxplot() +
  geom_point(position = position_jitter(width = 0.1), aes(colour = anesthetic)) +
  guides(colour = "none") +
  mice_labs

One-way ANOVA

Results of fitting one-way ANOVA to the anesthesia data

mice_m <- lm(co2 ~ anesthetic, data = mice)
anova(mice_m)
Analysis of Variance Table

Response: co2
           Df Sum Sq Mean Sq F value   Pr(>F)   
anesthetic  1 2509.1 2509.09  9.5647 0.005515 **
Residuals  21 5508.9  262.33                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One-way ANOVA

Multiple comparisons

mice_m |>
  avg_predictions(by = "anesthetic", hypothesis = "pairwise")

                          Term Estimate Std. Error     z Pr(>|z|)   S 2.5 %
 Isoflurane - Alpha chloralose    -20.9       6.76 -3.09  0.00198 9.0 -34.2
 97.5 %
  -7.66

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response